A Consistent and Efficient Estimator for Data-Oriented Parsing

نویسندگان

  • Andreas Zollmann
  • Khalil Sima'an
چکیده

Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One crucial property of a ‘good’ estimator is that its guess approaches the unknown distribution as the sample sequence grows large. This property is called consistency. This paper concerns estimators for natural language parsing under the DataOriented Parsing (DOP) model. The DOP model specifies how a probabilistic grammar is acquired from statistics over a given training treebank, a corpus of sentence-parse pairs. Recently, Johnson [15] showed that the DOP estimator (called DOP1) is biased and inconsistent. A second relevant problem with DOP1 is that it suffers from an overwhelming computational inefficiency. This paper presents the first (nontrivial) consistent estimator for the DOP model. The new estimator is based on a combination of held-out estimation and a bias toward parsing with shorter derivations. To justify the need for a biased estimator in the case of DOP, we prove that every non-overfitting DOP estimator is statistically biased. Our choice for the bias toward shorter derivations is justified by empirical experience, mathematical convenience and efficiency considerations. In support of our theoretical results of consistency and computational efficiency, we also report experimental results with the new estimator.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Consistent and Efficient Estimator for the Data-oriented Parsing Model

Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One desired property of an estimator is that its guess approaches the unknown distribution as the sample sequence grows large. Mathematically speaking, this property is called consistency. This thesis p...

متن کامل

A Consistent Estimator for Uniform Parameter Under Interval Censoring

‎The censored data are widely used in statistical tests and parameters estimation‎. ‎In some cases e.g‎. ‎medical accidents which data are not recorded at the time of occurrence‎, ‎some methods such as interval censoring are used‎. ‎In this paper‎, ‎for a random sample uniformly distributed on the interval (0,θ) ‎the interval censoring have been used‎. ‎A consistent estimator of θ  and some asy...

متن کامل

Data-Oriented Parsing

1. A DOP model for phrase-structure trees R. Bod and R. Scha 2. Probability models for DOP R. Bonnema 3. Encoding frequency information in stochastic parsing models 1. Computational complexity of disambiguation under DOP K. Sima'an 2. Parsing DOP with Monte Carlo techniques J. Chappelier and M. Rajman 3. Towards efficient Monte Carlo parsing R. Bonnema 4. Efficient parsing of DOP with PCFG-redu...

متن کامل

Estimation of E(Y) from a Population with Known Quantiles

‎In this paper‎, ‎we  consider the problem of  estimating E(Y) based on a simple random sample when at least one of the population quantiles is known‎. ‎We propose a stratified estimator of  E(Y)‎, ‎and show that it is strongly consistent‎. ‎We then establish the asymptotic normality of the suggested estimator‎, ‎and prove that it ...

متن کامل

Back-off as Parameter Estimation for DOP models

Data-Oriented Parsing (DOP) is a probabilistic performance approach to parsing natural language. Several DOP models have been proposed since it was introduced by Scha (1990), achieving promising results. One important feature of these models is the probability estimation procedure. Two major estimators have been put forward: Bod (1993) uses a relative frequency estimator; Bonnema (1999) adds a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Automata, Languages and Combinatorics

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2005